EDA

Read the data

Code
library(tidyr)
library(dplyr)
data <- read.csv("../data/derived-data/PFS_Merged_Cleaned_factored.csv")
head(data)
  X         STUDYID  USUBJID CNSR  TRT01P PPROTFL    STARTDT        ADT AVAL
1 1 54767414MMY3008 SUBJ0001    0 TREAT A       Y 2019-04-22 2021-01-12  631
2 2 54767414MMY3008 SUBJ0002    0 TREAT A       Y 2021-10-13 2024-04-04  904
3 3 54767414MMY3008 SUBJ0003    0 TREAT B       Y 2020-10-31 2020-11-30   30
4 4 54767414MMY3008 SUBJ0004    1 TREAT A       Y 2020-12-31 2021-09-25  268
5 5 54767414MMY3008 SUBJ0005    1 TREAT A       Y 2017-10-26 2018-07-13  260
6 6 54767414MMY3008 SUBJ0006    0 TREAT B       Y 2017-02-02 2017-12-13  314
          EVNTDESC            RSSTRESC TOTAE SAECOUNT AEDECOD_MS
1            DEATH      Stable Disease     1        0     Nausea
2            DEATH Progressive Disease     2        0    Fatigue
3            DEATH    Partial Response    NA       NA       <NA>
4 ANALYSIS CUT OFF Progressive Disease     4        1  Dizziness
5     LAST CONTACT      Stable Disease     4        1    Fatigue
6            DEATH   Complete Response     2        0   Headache
           AEBODSYS AESER      AGE    SEX  RACE                 ETHNIC      LDH
1  Gastrointestinal     N 70.51563 Female White Not Hispanic or Latino 139.7219
2    Nervous System     N 79.40260 Female White Not Hispanic or Latino 115.6110
3              <NA>  <NA> 76.33235   Male White Not Hispanic or Latino 172.8269
4  Gastrointestinal     N 77.44750   Male White Not Hispanic or Latino 144.8748
5    Nervous System     N 72.06439 Female White Not Hispanic or Latino 161.9881
6 General Disorders     N 65.23182   Male White Not Hispanic or Latino 175.1388
       ALB      HGB         EXTRT EXDOSE      BMI ECOG       CRP
1 155.2226 135.7905 Dexamethasone    110 26.83260    4 12.830053
2 128.2108 131.7716   Daratumumab     30 25.42065    1  6.702462
3 159.3242 142.5922 Dexamethasone     60 32.02936    2 11.959545
4 171.4453 129.9159  Lenalidomide    150 22.04537    1  2.101916
5 131.3127 164.3331   Daratumumab    130 20.38940    1  7.230141
6 141.2525 159.1366  Lenalidomide     20 21.33275    1  6.671318

First we want to understand the struture of the data and the data type of each variables

Code
# load library
# Load necessary libraries
dim(data)
[1] 683  28
Code
str(data)
'data.frame':   683 obs. of  28 variables:
 $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ STUDYID   : chr  "54767414MMY3008" "54767414MMY3008" "54767414MMY3008" "54767414MMY3008" ...
 $ USUBJID   : chr  "SUBJ0001" "SUBJ0002" "SUBJ0003" "SUBJ0004" ...
 $ CNSR      : int  0 0 0 1 1 0 0 0 1 0 ...
 $ TRT01P    : chr  "TREAT A" "TREAT A" "TREAT B" "TREAT A" ...
 $ PPROTFL   : chr  "Y" "Y" "Y" "Y" ...
 $ STARTDT   : chr  "2019-04-22" "2021-10-13" "2020-10-31" "2020-12-31" ...
 $ ADT       : chr  "2021-01-12" "2024-04-04" "2020-11-30" "2021-09-25" ...
 $ AVAL      : int  631 904 30 268 260 314 347 691 180 296 ...
 $ EVNTDESC  : chr  "DEATH" "DEATH" "DEATH" "ANALYSIS CUT OFF" ...
 $ RSSTRESC  : chr  "Stable Disease" "Progressive Disease" "Partial Response" "Progressive Disease" ...
 $ TOTAE     : int  1 2 NA 4 4 2 1 4 2 4 ...
 $ SAECOUNT  : int  0 0 NA 1 1 0 0 2 1 0 ...
 $ AEDECOD_MS: chr  "Nausea" "Fatigue" NA "Dizziness" ...
 $ AEBODSYS  : chr  "Gastrointestinal" "Nervous System" NA "Gastrointestinal" ...
 $ AESER     : chr  "N" "N" NA "N" ...
 $ AGE       : num  70.5 79.4 76.3 77.4 72.1 ...
 $ SEX       : chr  "Female" "Female" "Male" "Male" ...
 $ RACE      : chr  "White" "White" "White" "White" ...
 $ ETHNIC    : chr  "Not Hispanic or Latino" "Not Hispanic or Latino" "Not Hispanic or Latino" "Not Hispanic or Latino" ...
 $ LDH       : num  140 116 173 145 162 ...
 $ ALB       : num  155 128 159 171 131 ...
 $ HGB       : num  136 132 143 130 164 ...
 $ EXTRT     : chr  "Dexamethasone" "Daratumumab" "Dexamethasone" "Lenalidomide" ...
 $ EXDOSE    : int  110 30 60 150 130 20 170 80 20 140 ...
 $ BMI       : num  26.8 25.4 32 22 20.4 ...
 $ ECOG      : int  4 1 2 1 1 1 4 0 2 1 ...
 $ CRP       : num  12.83 6.7 11.96 2.1 7.23 ...

Since our project is about Survival Analysis, we need to check on the CNSR column, where 0 means Death and 1 means Censoring

Code
library(ggplot2)
ggplot(data, aes(x = factor(CNSR))) +
  geom_bar(fill = "steelblue", color = "black") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3) +
  labs(
    title = "Bar Plot of CNSR Column",
    x = "Censoring Status (0 = Death, 1 = Censored)",
    y = "Count"
  ) +theme_minimal()

We are also interested in the other categorical variables, such as demographic information and Adverse Event related variables.

Code
categorical_columns <- c("SEX", "RACE", "ETHNIC","EXTRT", "RSSTRESC", "AEDECOD_MS", "AEBODSYS", "AESER", "TRT01P")
for (col in categorical_columns) {
  # Create a bar plot
  p <- ggplot(data, aes_string(x = col)) +
    geom_bar(fill = "steelblue", color = "black") +
    geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3) +
    labs(
      title = paste("Bar Plot of", col),
      x = col,
      y = "Count"
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels if needed
  
  # Print the plot
  print(p)
}

One thing to notice is that the majority of the subjects are white, and most of the subjects don’t have serious Adverse Event. Other than these, other categorical variables are balanced.

Now we want to take a look at the numerical variables We first want to know if there are outliers in each numerical variables

Code
numerical_columns <- c("LDH", "ALB", "HGB", "EXDOSE", "BMI", "ECOG", "CRP", "TOTAE", "SAECOUNT", "AGE")
for (col in numerical_columns) {
  p <- ggplot(data, aes_string(y = col)) +
    geom_boxplot(fill = "lightblue", color = "black", outlier.colour = "red") +
    labs(title = paste("Boxplot of", col), y = col) +
    theme_minimal()
  
  print(p)
}

Next we want to check the distribution of the numerical variables

Code
for (col in numerical_columns) {
  p <- ggplot(data, aes_string(x = col)) +
    geom_density(fill = "steelblue", alpha = 0.6) +
    labs(title = paste("Density Plot of", col), x = col, y = "Density") +
    theme_minimal()
  
  print(p)
}

LDH, ALB and HGB look similar, we can make a reasonable assumption that they are correlated. We can see that Age, MBI and CRP are all nearly normally distributed. For ECOG, TOTAE and SAECOUNT, since their values are all integer, their distribution plot will seem a little weird, but it is within our expectation.

Now we are curious about the statistical summary of these numerical columns.

Code
numerical_summary <- summary(data[numerical_columns])
print(numerical_summary)
      LDH               ALB               HGB              EXDOSE     
 Min.   :  3.383   Min.   :  2.986   Min.   :  2.793   Min.   : 10.0  
 1st Qu.:  4.868   1st Qu.:  4.900   1st Qu.:  4.855   1st Qu.: 50.0  
 Median : 12.855   Median : 13.049   Median : 12.953   Median :110.0  
 Mean   : 56.224   Mean   : 56.568   Mean   : 54.914   Mean   :105.1  
 3rd Qu.:137.719   3rd Qu.:138.617   3rd Qu.:133.987   3rd Qu.:150.0  
 Max.   :211.675   Max.   :194.763   Max.   :224.314   Max.   :200.0  
                                                                      
      BMI             ECOG            CRP             TOTAE      
 Min.   :10.77   Min.   :0.000   Min.   : 1.843   Min.   :1.000  
 1st Qu.:21.92   1st Qu.:1.000   1st Qu.: 7.784   1st Qu.:1.000  
 Median :25.60   Median :2.000   Median : 9.767   Median :2.000  
 Mean   :25.35   Mean   :2.059   Mean   : 9.807   Mean   :2.462  
 3rd Qu.:28.94   3rd Qu.:3.000   3rd Qu.:11.812   3rd Qu.:3.000  
 Max.   :42.26   Max.   :4.000   Max.   :18.781   Max.   :8.000  
                                                  NA's   :88     
    SAECOUNT           AGE       
 Min.   :0.0000   Min.   :57.35  
 1st Qu.:0.0000   1st Qu.:69.80  
 Median :0.0000   Median :73.25  
 Mean   :0.3496   Mean   :73.06  
 3rd Qu.:1.0000   3rd Qu.:76.42  
 Max.   :3.0000   Max.   :87.47  
 NA's   :88                      

Furthermore we are curious about the correlation between these numerical vairables

Code
library(reshape2)
library(tidyr)
# Calculate the correlation matrix for the numerical columns
cor_matrix <- cor(data[numerical_columns], use = "complete.obs", method = "pearson")
cor_matrix_melted <- melt(cor_matrix)

# Create the heatmap using ggplot2
ggplot(cor_matrix_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(midpoint = 0, low = "red", high = "green", mid = "white") +
  theme_minimal() +
  labs(title = "Correlation Heatmap", x = "Variables", y = "Variables") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can see that only LDH ALB and HGB are correlated to each other, and we may consider choosing only one of these biomarker in our analysis

We now have a decent understanding of this data, now we can move to data preprocess and feature selections for our models.